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Abstract 

Most genome assemblers construct point estimates, choosing only a single genome sequence from among many 
alternative hypotheses that are supported by the data. We present a Markov chain Monte Carlo approach to sequence 
assembly that instead generates distributions of assembly hypotheses with posterior probabilities, providing an explicit 
statistical framework for evaluating alternative hypotheses and assessing assembly uncertainty. We implement this 
approach in a prototype assembler, called Genome Assembly by Bayesian Inference (GABI), and illustrate its application to 
the bacteriophage <I>X174. Our sampling strategy achieves both good mixing and convergence on lllumina test data for 
(DX174, demonstrating the feasibility of our approach. We summarize the posterior distribution of assembly hypotheses 
generated by GABI as a majority-rule consensus assembly. Then we compare the posterior distribution to external 
assemblies of the same test data, and annotate those assemblies by assigning posterior probabilities to features that are in 
common with GABI's assembly graph. GABI is freely available under a GPL license from https://bitbucket.org/mhowison/ 
gabi. 
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Introduction 

Most current methods for de novo genome assembly generate 
point estimates of the genome sequence without explicit statistical 
information about confidence in this particular estimate, or its 
support relative to other alternative assemblies supported by the 
same sequence data [1]. Recendy, several probabilistic approaches 
have been proposed to quantify assembly certainty and address 
these limitations. Computing Genome Assembly Likelihood 
(CGAL) approximates the likelihood of an assembly given the 
sequence reads and a generative model learned from the data [2] ; 
Log Average Probability (LAP) approximates the likelihood under 
a similar model of sequence read generation [3]; and Assembly 
Likelihood Evaluation (ALE) uses an empirical Bayesian approach 
to estimate the posterior probability of a particular assembly (or 
components of that assembly) [4]. Historically, the idea of applying 
statistical criteria to the fragment assembly problem was first 
proposed by Myers [5], who formulated the problem as a 
maximum-likelihood search. 

Here we present an alternative Bayesian approach to approx- 
imating posterior probabilities, using Markov chain Monte Carlo 
(MCMC) sampling [6] to generate an entire posterior distribution 
of assembly hypotheses. An MCMC framework for sequence 
assembly addresses many of the general problems of accommo- 
dating assembly uncertainty, and provides a probabilistic under- 
standing of the support for an assembly as a whole, as well as for 
particular features of the assembly that the investigator has special 
interest in. For example, it may be that each individual assembly 



hypothesis has low overall confidence, but that some feature, such 
as the order of a set of genes or existence of a specific regulatory 
element, is consistent across all the assembly hypotheses and has 
high confidence. 

Design and Implementation 

To demonstrate the feasibility of the MCMC approach, we 
have implemented a prototype assembler called Genome Assem- 
bly by Bayesian Inference (GABI, https:/ /bitbucket.org/ 
mhowison/gabi). Because our approach is computationally 
expensive, our current tests are limited to the small but well- 
studied genome of the (DX174 phage (which has NCBI Reference 
Sequencec NC_001422.1). Our input data consist of 2,000 read- 
pairs drawn from lllumina (DX174 sample data (http://www. 
illumina.com/systems/miseq/scientific_data.ilmn). Aavailable as 
File SI, this subset of reads has maximum length 251 bp, an 
estimated mean insert size of 357 bp, and approximately 93 x 
coverage of the 0X174 genome. 

Conveniendy, our approach has technical parallels and similar 
conceptual advantages as the advances made over the past 15 
years in the application of Bayesian approaches to phylogenetic 
analysis [7-9]. As in phylogenetics and other fields, designing an 
MCMC sampling strategy that achieves good mixing and 
convergence poses a variety of challenges and opportunities. For 
sequence assembly, the challenge is to design a mechanism for 
proposing new assemblies, a means of calculating the likelihood of 
a proposed assembly under an explicit model of read generation, 
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and the specification of a prior probability distribution that 
describes some prior beliefs about the assembly before the 
sequence data were collected. To implement an MCMC sequence 
assembler, we made the following choices: 

Proposal mechanism 

We first build an assembly graph starting from a de Bruijn 
graph of the reads. Then we remove all tips (since the (DX174 
genome is circular) and merge all unambiguous paths into single 
nodes that are annotated by the sequence of merged A>mers. The 
resulting unresolved assembly graph (no longer de Bruijn) is a 
directed graph that consists only of bubbles and is a minimal 
representation of the variants that can be inferred from the 
sequenced data. Concatenating the sequences across the nodes in 
a particular path through this graph gives a possible assembly 
sequence. 

A particular assembly hypothesis is represented as a boolean 
vector, whose values indicate which of the enumerated edges in 
the assembly graph are in an active path (Figure 1). We start with a 
random vector and propose new assemblies by toggling a value at 
random in the current vector. However, when toggling, we enforce 
the constraint that an active path cannot take multiple branches 
through the same bubble, since this would not spell a contiguous 
sequence. Instead, turning on an edge in an alternate branch turns 
on the entire alternate branch and turns off the existing path 
through the bubble (Figure 2). In this way, we can propose 
alternate paths through a complex bubble in one move of the 
sampler, without having to split the path and wait for the alternate 
branch to randomly activate. Furthermore, we can repeat this 
perturbation process a specified number of times for each new 
sample. Using several perturbations per sample causes the 
proposal mechanism to take larger steps through the combinato- 
rial space of possible assemblies. 

In our prototype implementation, we only consider paths that 
visit an edge at most once. In the general case, however, the 
proposal mechanism should be extended to accommodate paths 
that revisit edges, in order to resolve repeat regions of a genome. 

Likelihood 

Because of the constraints described above, the active paths in 
every proposed assembly can be output as a FASTA file of 



concatenated node sequences. We run LAP directly on this 
FASTA file to estimate the proposed assembly's likelihood. 

Priors 

An uninformative (e.g., flat) prior distribution could be used if 
nothing is known about the genome to be assembled. Since 
<1>X174 is well known, and it has previously been established that 
the genome is 5,386 bp and consists of one circular chromosome, 
we construct a prior probability distribution as the product of two 
gamma distributions, one for the sum of contig lengths (centered at 
5,386) and the other for the number of contigs in the assembly 
(centered at 1). In other cases, external information about repeat 
structure or gene synteny could also be incorporated as priors. 

Results 

To evaluate the mixing and convergence of our MCMC 
sampler, we ran three independent chains and compared their 
traces and split frequencies, as is common practice for other 
applications of MCMC [10]. Our results show that our design 
achieved good mixing when using a simple Metropolis sampler 
[1 1] with three perturbations per assembly proposal (Figure 3a). 
The cumulative frequencies of the individual nodes and edges 
were mostly stabilized after 4,000 accepted samples (Figure 3b). 
Those with weak support were likely assembled from reads with 
sequencing errors. Other nodes and edges were more ambiguous, 
with a cumulative frequency hovering at an intermediate value or 
varying across samples. Ten nodes achieved a frequency of 100% 
(darkest red nodes in Figure 4a) and these include the four nodes 
representating the longest unambiguous sequences in the assembly 
graph (each longer than 400 bp), for a total of 2,428 bp of 
sequence. 

Overall, the standard deviation in edge frequencies between the 
chains decreased with additional sampling, indicating that the 
independent chains converged to the same posterior distribution 
(Figure 3c). The split frequencies among the chains were mostly 
correlated after the last sample (Figure 3d). We also tested the 
sampler with flat priors, with no likelihood calculations (priors 
only), and with different choices of the shape parameter for the 
gamma distributions (Figure 5). Our acceptance rate was 22.4%, 
which is close to the heuristic of 25% that can be considered 
optimal for general Metropolis sampling [12]. Aggregated across 
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Figure 1. Assembly representation. Contiguous paths through the assembly graph correspond to proposed assemblies, and are represented by 
a boolean vector indicating which edges in the graph are active. 
doi:1 0.1 371 /journal.pone.0099497.g001 
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Figure 2. Customized proposal mechanism, (a) Starting from an existing (blue) assembly, the proposal mechanism randomly chooses a new 
(green) edge, (b) If the edge is not already active, it is extended with a random walk until it meets the existing assembly, (c) This defines a (red) 
branch in the existing assembly, (d) which is then removed to yield a new (blue) assembly. 
doi:1 0.1 371 /journal.pone.0099497.g002 



all three chains, the mean compute time per sample was 1.2 
seconds and total compute time was 166.3 CPU-hours. 

GABI provides multiple ways for summarizing the results of an 
MCMC analysis and is built on top of BioLite, a light-weight 
bioinformatics framework with rich diagnostics and reporting 
capabilities [13]. The assembly graph can be annotated with the 
approximated posterior probabilities (Figure 4a), or a consensus 
assembly can be extracted, for instance a majority-rule consensus 
that shows all edges occurring with frequency > 50% (Figure 4b). 
The report also includes a FASTA file for the majority-rule 
consensus, annotated with the posterior probabilities of its 
components, and an interactive animation of the sampling for 
each MCMC chain, created with the D3js data visualization 
toolkit [14]. A sample report for the MCMC chains analyzed here 
is archived in the Brown Digital Repository [15]. 



GABI includes a tool to assign posterior probabilities to features 
of external assemblies that correspond to features in its own 
assembly graph. This provides an explicit and unified statistical 
framework for comparison of assemblies produced by multiple 
methods and software tools. Here, we compare NCBI Reference 
Sequence NC_00 1422.1 and de novo point-estimate assemblies 
generated by the String Graph Assembler [16] and Velvet [17]. 
We require exact matches to identify corresponding features, and 
this conservative strategy means that there is not an exact 
correspondence between the NCBI reference sequence and the 
GABI graph (Figure 4e). For this simple assembly problem, there is 
nearly universal agreement among the assemblies: both the 
majority rule consensus (Figure 4b) and NCBI reference sequence 
(Figure 4e) are proper subsets of the two de novo point-estimate 
assemblies (Figure 4c— d). The one notable difference is that the 
SGA assembly contains two contigs that choose alternate paths in 
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Figure 3. Evidence of good mixing and convergence of three independent MCMC assembly chains, (a) Early in the sampling, the 
log(likelihood) reaches a stationary distribution with random noise, indicating good mixing of the chains, (b) Plotting the cumulative node/edge 
frequencies shows that most of the frequencies have reached a stationary value, (c) The average standard deviation among the three chains of the 
cumulative frequencies approaches zero, (d) A bivariate plot comparing node/edge frequencies between each pair of chains shows that the 
frequencies are in agreement across all chains. Both (c) and (d) indicate convergence. 
doi:1 0.1 371 /journal.pone.0099497.g003 



one of the bubbles, and these alternate paths have similar posterior 
probabilities (green highlight in Figure 4c ). 

Discussion 

A challenge to scaling MCMC assembly is that the combina- 
torial complexity of larger assembly graphs could become 
prohibitive for full de novo Bayesian assembly of large genomes. 
There are several ways to address this, such as applying more 
sophisticated sampling methods like Metropolis-coupled MCMC 
[18] or bridging states [19] that can explore larger combinatorial 
spaces more efficiently; constraining the assembly graph to focus 
on particular assembly hypotheses instead of attempting full de 
novo assembly; partitioning a larger genome based on an existing 
draft assembly and performing MCMC assembly concurrently on 
subgraphs of less complexity; or pruning the assembly graph using 
additional data from restriction site mappings [20]. Another 
promising application is transcriptome assembly, since the 
assembly graphs for individual transcripts should be less complex 
and can be sampled independently. 

Our prototype assembler cannot yet consider repeats, as each 
edge can only be visited once. Repeats are common in many 
genomes, and addressing them is critical if this approach is to be 



widely adopted. Fortunately, a straight-forward extension of our 
graph representation from a vector space of booleans to a vector 
space of non-negative integers can represent repeats, where the 
integer value for an edge denotes how many times it is visited. The 
proposal mechanism can be modified to enforce constraints on the 
in- and out-degree of the nodes in the path, to ensure that a 
proposed path can be unwound into a valid, contiguous sequence. 
Because paired-end alignments are used to calculate the likelihood, 
a path with the correct number of repeats will score a higher 
likelihood. We expect this to work well because two maximum- 
likelihood assembly approaches have already used the same 
principle to demonstrate high accuracy in resolving repeat regions 
[21,22]. 

Like existing assemblers, GABI can be used to assemble a point 
estimate of a genome, but unlike most other assemblers, the 
resulting assembly will have been chosen according to explicit 
statistical criteria (posterior probability) and will have associated 
information on the confidence in various aspects of its sequence 
and structure. In addition, MCMC assembly provides new 
opportunities for investigators who are interested not only in the 
certainty of a particular inference (e.g. an ALE [4] estimate of 
posterior probability for a given assembly), but in the many 
alternative hypotheses that are also supported by the data. The 
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Figure 4. Graphical summaries and comparisons of the QX174 genome assemblies. Each node Is a sequence fragment, and alternative 
paths through the graph represent alternative assemblies. Node size is proportional to the log length of its sequence, (a) Posterior probabilities for 
each node/edge (dark red indicates a posterior of 1.0, grey a posterior of 0.0 ). (b-e) Components of the GABI assembly graph that are found exactly 
in the majority-rule consensus GABI assembly (b), SGA assembly (c), Velvet assembly (d), and NCBI reference sequence (e). 
doi:1 0.1 371 /journal.pone.0099497.g004 



MCMC approach addresses many of the general problems of 
summarizing assembly uncertainty and will allow assembly 
uncertainty to be propagated to downstream analyses [1], 

Methods 

The results presented were generated with GABI version 0.2.0 
and are recomputable using included scripts (https:/ /bitbucket. 
org/ mhowison/ gabi/ src/ master/ phix-test). Here, we provide 
brief comments on the technical details. 

Compute resources 

All tests were run at the Center for Computation and 
Visualization, Brown University, on IBM iDataPlex nodes, each 
equipped with 8-core, dual-socket Intel Xeon E5540 (2.53 Ghz) 
processors and 24 GB of memory. CPU-hours were calculated as 
total walltime across all nodes, multiplied by 8 CPUs per node. 



Subset of CDX174 sample data 

We started with 8.36 Gb of paired-end reads of maximum 
length 251 bp from Illumina 0X174 sample data (ftp:// 
webdata:webdata@ussd-ftp. illumina.com/Data/SequencmgRuns/ 
PhiX/PhiX_Sl_L001_Rl_001.fastq.gz and ftp://webdata:web 
data@ussd-ftp. illumina.com/Data/SequencingRuns/PhiX/PhiX_ 
Sl_L001_R2_001.fastq.gz, accessed 2013 Apr 26). Using the bl- 
filter-illumina tool from BioLite, we chose the first 2,000 read pairs 
that did not contain known Illumina adapter sequences and that 
had mean quality score greater than 37 (Phred-33 scoring). This 
procedure is available in the script https:/ /bitbucket.org/ 
mhowison/gabi/ src/ master/ phix-test/ 00-subset-data.sh. 

SGA and Velvet assemblies 

To assemble the <1>X174 subset with SGA, we followed the 
example provided for an assembly of E. coli from similar MiSeq 
reads (https:// github.com/jts/ sga/blob/ master/src/ examples/ 
sga-ecoli-miseq.sh). To assemble with Velvet, we used the included 
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Figure 5. The prior probabilities influence the posterior distribution. Running the sampler with no data so that the posterior probabilities 
are determined by the prior probabilities only (a), with flat priors (b), or with different values of the shape parameter k for the gamma distributions of 
genome length and number of contigs (c-e) all cause shifts in the posterior distribution of edge frequencies. The baseline for the change in 
frequency is Chain 1 from Figure 3 with A' = 1 0, and all chains here were initialized with the same random seed as Chain 1. 
doi:1 0.1 371 /journal.pone.0099497.g005 



VelvetOptimiser tool to sweep coverage and cutoff parameters for 
a k-mer size of 99. The estimated mean and standard deviation of 
the insert size reported earlier were obtained from VelvetOpti- 
miser's output. This procedure is available in the script https:// 
bitbucket.org/ mhowison/ gabi/ src/ master/ phix-test/ 0 1 -assemble, 
sh. 

De Bruijn graph reduction 

First, we recursively and completely trimmed all tips (edges with 
an incidence of one), because the OX 174 genome is circular (for a 



linear genome, a softer tip trimming threshold would be more 
appropriate). Next, we merged all simple paths through the graph, 
similar to the process of reducing an overlap graph to a string 
graph [23]. We annotated the new node with the accumulated 
overlap of the merged nodes' fc-mers, which accumulates one 
additional nucleotide for each merged node. Finally, we split the 
graph into weakly connected components and choose the largest 
one. A path of edges through the graph spells a contig by 
concatenating the annotations on the nodes in the path. 
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Odds ratio 

To determine whether to accept a proposed assembly hypoth- 
esis, we calculate the odds ratio [11] of the posterior probabilities 
of the old assembly, P(H\D), and the new perturbed assembly, 
P(H*\D), as: 



R 



P(H*\D) 
P(H\D) 



P(D\H*)P(H*) 

P(D) 
P(D\H)P(H) 

P(D) 



P(D\H*)P(H*) 
P(D\H)P(H) 



(1) 



Supporting Information 

File SI <1>X174 sequence data. A subset of 2,000 paired-end 
reads (maximum length 251 bp, estimated mean insert size 
357 bp) drawn from Illumina MiSeq sample data (http:/ /www. 
illumina.com/systems/miseq/scientific_data.ilmn ). 
(ZIP) 
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If this ratio is greater than 1 , the new assembly is automatically 
accepted. If it is less than 1 , it is accepted with probability R. The 
process is then repeated until there is a stationary distribution of 
assemblies in the sample, which occurs when the frequency of 
assembly attributes does not change with additional sampling. 
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